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Introduction 


For the past two centuries or more, a variety of devices capable of generating artifi- 
cial or synthetic speech have been developed and used to investigate phonetic phenom- 
ena. The aim of this chapter is to provide a brief history of synthetic speech systems, 
including mechanical, electrical, and digital types. The primary goal, however, is not 
to reiterate the details of constructing specific synthesizers but rather to focus on the 
motivations for developing various synthesis paradigms and illustrate how they have 
facilitated research in phonetics. 


The mechanical and electro-mechanical era 


On the morning of December 20, 1845, a prominent American scientist attended 
a private exhibition of what he would later refer to as a “wonderful invention.” The 
scientist was Joseph Henry, an expert on electromagnetic induction and the first Sec- 
retary of the Smithsonian Institution. The “wonderful invention” was a machine that 
could talk, meticulously crafted by a disheveled 60-year-old tinkerer from Freiburg, 
Germany named Joseph Faber. Their unlikely meeting in Philadelphia, Pennsylvania, 
arranged by an acquaintance of Henry from the American Philosophical Society, might 
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have occurred more than a year earlier had Faber not destroyed a previous version of 
his talking machine in a bout of depression and intoxication. Although he had spent 
some 20 years perfecting the first device, Faber was able to reconstruct a second version 
of equal quality in a year’s time (Patterson, 1845). 

The layout of the talking machine, described in a letter from Henry to his col- 
league H.M. Alexander, was like that of a small chamber organ whose keyboard was 
connected via strings and levers to mechanical constructions of the speech organs. A 
carved wooden face was fitted with a hinged jaw, and behind it was an ivory tongue that 
was moveable enough to modulate the shape of the cavity in which it was housed. A 
foot-operated bellows supplied air to a rubber glottis whose vibration provided the raw 
sound that could be shaped into speech by pressing various sequences or combinations 
of 16 keys available on a keyboard. Each key was marked with a symbol representing 
an “elementary” sound that, through its linkage to the artificial organs, imposed time- 
varying changes to the air cavity appropriate for generating apparently convincing 
renditions of connected speech. Several years earlier Henry had been shown a talking 
machine built by the English scientist Charles Wheatstone, but he noted that Faber’s 
machine was far superior because instead of uttering just a few words, it was “capable of 
speaking whole sentences composed of any words what ever” (Rothenberg et al., 1992, 
p. 362). 

In the same letter, Henry mused about the possibility of placing two or more of 
Faber’s talking machines at various locations and connecting them via telegraph lines. 
He thought that with “little contrivance” a spoken message could be coded as keystrokes 
in one location which, through electromagnetic means, would set into action another 
of the machines to “speak” the message to an audience at a distant location. Another 
30 years would pass before Alexander Graham Bell demonstrated his invention of the 
telephone, yet Henry had already conceived of the notion while witnessing Faber’s ma- 
chine talk. Further, unlike Bell’s telephone, which transmitted an electrical analog of 
the speech pressure wave, Henry’s description alluded to representing speech in com- 
pressed form based on slowly varying movements of the operator’s hands, fingers, and 
feet as they formed the keystroke sequences required to produce an utterance, a sig- 
nal processing technique that would not be implemented into telephone transmission 
systems for nearly another century. 

It is remarkable that, at this moment in history, a talking machine had been con- 
structed that was capable of transforming a type of phonetic representation into a sim- 
ulation of speech production, resulting in an acoustic output heard clearly as intelligible 
speech - and this same talking machine had inspired the idea of electrical transmission 
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of low- bandwidth speech. The moment is also ironic, however, considering that no 
one seized either as an opportunity for scientific or technological advancement. Henry 
understandably continued on with his own scientific pursuits, leaving his idea to one 
short paragraph in an obscure letter to a colleague. In need of funds, Faber signed on 
with the entertainment entrepreneur P.T. Barnum in 1846 to exhibit his talking ma- 
chine for a several months run at the Egyptian Hall in London. In his autobiography, 
Barnum (1886) noted that a repeat visitor to the exhibition was the Duke of Welling- 
ton, who Faber eventually taught to “speak” both English and German phrases with the 
machine (Barnum, 1886, p. 134). In the exhibitor’s autograph book, the Duke wrote 
that Faber’s “Automaton Speaker” was an “extraordinary production of mechanical ge- 
nius.” Other observers also noted the ingenuity in the design of the talking machine 
(e.g., “The Speaking Automaton,” 1846; Athenaeum, 1846), but to Barnum’s puzzle- 
ment it was not successful in drawing public interest or revenue. Faber and his machine 
were eventually relegated to a traveling exhibit that toured the villages and towns of 
the English countryside; it was supposedly here that Faber ended his life by suicide, 
although there is no definitive account of the circumstances of his death (Altick, 1978). 
In any case, Faber disappeared from the public record, although his talking machine 
continued to make sideshow-like appearances in Europe and North America over the 
next 30 years; it seems a relative (perhaps a niece or nephew) may have inherited the 
machine and performed with it to generate income (“Talking Machine,” 1880; Altick, 
1978). 

Although the talking machine caught the serious attention of those who understood 
the significance of such a device, the overall muted interest may have been related to 
Faber’s lack of showmanship, the German accent that was present in the machine’s 
speech regardless of the language spoken, and perhaps the fact that Faber never pub- 
lished any written account of how the machine was designed or built - or maybe a 
mechanical talking machine, however ingenious its construction, was, by 1846, sim- 
ply considered passé. Decades earlier, others had already developed talking machines 
that had impressed both scientists and the public. Most notable were Christian Gottlieb 
Kratzenstein and Wolfgang von Kempelen, both of whom had independently devel- 
oped mechanical speaking devices in the late 18th century. 

Inspired by a competition sponsored by the Imperial Academy of Sciences at St. 
Petersburg in 1780, Kratzenstein submitted a report that detailed the design of five or- 
gan pipe-like resonators that, when excited with the vibration of a reed, produced the 
vowels /a, e, i, o, u/ (Kratzenstein, 1781). Although their shape bore little resemblance 
to human vocal tract configurations, and they could produce only sustained sounds, 
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the construction of these resonators won the prize and marked a shift toward scien- 
tific investigation of human sound production. Kratzenstein, who at the time was a 
Professor of Physics at the University of Copenhagen, had shared a long-term interest 
in studying the physical nature of speaking with a former colleague at St. Petersburg, 
Leonhard Euler, who likely proposed the competition. Well known for his contribu- 
tions to mathematics, physics, and engineering, Euler wrote in 1761 that “all the skill 
of man has not hitherto been capable of producing a piece of mechanism that could 
imitate [speech]” (p. 78) and further noted that “The construction of a machine capable 
of expressing sounds, with all the articulations, would no doubt be a very important 
discovery” (Euler, 1761, p. 79). He envisioned such a device to be used in assistance of 
those “whose voice is either too weak or disagreeable” (Euler, 1761, p. 79). 

During the same time period, von Kempelen - a Hungarian engineer, industrialist, 
and government official - used his spare time and mechanical skills to build a talking 
machine far more advanced than the five vowel resonators demonstrated by Kratzen- 
stein. The final version of his machine was to some degree a mechanical simulation of 
human speech production. It included a bellows as a “respiratory” source of air pressure 
and air flow, a wooden “wind” box that emulated the trachea, a reed system to gener- 
ate the voice source, and a rubber funnel that served as the vocal tract. There was an 
additional chamber used for nasal sounds, and other control levers that were needed for 
particular consonants. Although it was housed in a large box, the machine itself was 
small enough that it could have been easily held in the hands. Speech was produced by 
depressing the bellows, which caused the “voice” reed to vibrate. The operator then 
manipulated the rubber vocal tract into time-varying configurations that, along with 
controlling other ports and levers, produced speech at the word level, but could not 
generate full sentences due to the limitations of air supply and perhaps the complexity 
of controlling the various parts of the machine with only two hands. The sound quality 
was child-like, presumably due to the high fundamental frequency of the reed and the 
relatively short rubber funnel serving as the vocal tract. In an historical analysis of von 
Kempelen’s talking machine, Dudley and Tarnoczy (1950) note that this quality was 
probably deliberate because a child’s voice was less likely to be criticized when demon- 
strating the function of the machine. Kempelen may have been particularly sensitive 
to criticism considering that he had earlier constructed and publicly demonstrated a 
chess-playing automaton that was in fact a hoax (cf, Carroll, 1975). Many observers 
initially assumed that his talking machine was merely a fake as well. 

Kempelen’s lasting contribution to phonetics is his prodigious written account of 
not only the design of his talking machine, but also the nature of speech and language 
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in general (von Kempelen, 1791). In “On the Mechanism of Human Speech” [English 
translation], he describes the experiments that consumed more than 20 years and clearly 
showed the significance of using models of speech production and sound generation to 
study and analyze human speech. This work motivated much subsequent research on 
speech production, and to this day still guides the construction of replicas of his talking 
machine for pedagogical purposes (cf., Trouvain and Brackhane, 2011). 

One person particularly inspired by von Kempelen’s work was, in fact, Joseph Faber. 
According to a biographical sketch (Wurzbach, 1856), while recovering from a serious 
illness in about 1815, Faber happened onto a copy of “On the Mechanism of Human 
Speech” and became consumed with the idea of building a talking machine. Of course, 
he built not a replica of von Kempelen’s machine, but one with a significantly advanced 
system of controlling the mechanical simulation of speech production. As remarkable 
as Faber’s machine seems to have been regarded by some observers, Faber was indeed 
late to the party, so to speak, for the science of voice and speech had by the early 
1800s already shifted into the realm of physical acoustics. Robert Willis, a professor 
of mechanics at Cambridge University, was dismayed by both Kratzenstein’s and von 
Kempelen’s reliance on trial-and-error methods in building their talking machines, 
rather than acoustic theory. He took them to task, along with most others working 
in phonetics at the time, in his 1829 essay titled “On the Vowel Sounds, and on Reed 
Organ-Pipes.” The essay begins: 


The generality of writers who have treated on the vowel sounds appear 
never to have looked beyond the vocal organs for their origin. Apparently 
assuming the actual forms of these organs to be essential to their production, 
they have contented themselves with describing with minute precision the 
relative positions of the tongue, palate and teeth peculiar to each vowel, or 
with giving accurate measurements of the corresponding separation of the 
lips, and of the tongue and uvula, considering vowels in fact more in the light 
of physiological functions of the human body than as a branch of acoustics. 
(Willis, 1829, p. 231) 


Willis laid out a set of experiments in which he would investigate vowel produc- 
tion by deliberately neglecting the organs of speech. He built reed-driven organ pipes 
whose lengths could be increased or decreased with a telescopic mechanism, and then 
determined that an entire series of vowels could be generated with changes in tube 
length and reeds with different vibrational frequencies. Wheatstone (1837) later pointed 
out that Willis had essentially devised an acoustic system that, by altering tube length, 


5 


B. Story, final draft 12.15.18 


and hence the frequencies of the tube resonances, allowed for selective enhancement of 
harmonic components of the vibrating reed. Wheatstone further noted that multiple 
resonances are exactly what is produced by the “cavity of the mouth,” and so the same 
effect occurs during speech production but with a nonuniformly shaped tube. 

Understanding speech as a pattern of spectral components became a major focus 
of acousticians studying speech communication for much of the 19th century and the 
very early part of the 20th century. As a result, developments of machines to produce 
speech sounds were also largely based on some form of spectral addition, with little or 
no reference to the human speech organs. For example, in 1859 the German scientist 
Hermann Helmholtz devised an electromagnetic system for maintaining the vibration 
of a set of eight or more tuning forks, each variably coupled to a resonating cham- 
ber to control amplitude (Helmholtz, 1859, 1875). With careful choice of frequencies 
and amplitude settings he demonstrated the artificial generation of five different vow- 
els. Rudolph Koenig, a well-known acoustical instrument maker in 1800s, improved 
on Helmholtz’s design and produced commercial versions that were sold to interested 
clients (Pantalony, 2004). Koenig was also a key figure in emerging technology that 
allowed for recording and visualization of sound waves. His invention of the phonoau- 
tograph with Edouard-Léon Scott in 1859 transformed sound via a receiving cone, di- 
aphragm, and stylus into a pressure waveform etched on smoked paper rotating about 
a cylinder. A few years later he introduced an alternative instrument in which a flame 
would flicker in response to a sound, and the movements of flame were captured on a 
rotating mirror, again producing a visualization of the sound as a waveform (Koenig, 
1873). 

These approaches were precursors to a device called the “phonodeik” that would be 
later developed at the Case School of Applied Science by Dayton Miller (1909) who 
eventually used it to study waveforms of sounds produced by musical instruments and 
human vowels. In a publication documenting several lectures given at the Lowell Insti- 
tute in 1914, Miller (1916) describes both the analysis of sound based on photographic 
representations of waveforms produced by the phonodeik, as well as intricate machines 
that could generate complex waveforms by adding together sinusoidal components and 
display the final product graphically so that it might be compared to those waveforms 
captured with the phonodeik. Miller referred to this latter process as harmonic synthe- 
sis, a term commonly used to refer to building complex waveforms from basic sinusoidal 
elements. It is, however, the first instance of the word “synthesis” in the present chapter. 
This was deliberate to remain true to the original references. Nowhere in the literature 
on Kratzenstein, von Kempelen, Wheatstone, Faber, Willis, or Helmholtz does “syn- 
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thesis” or “speech synthesis” appear. Their devices were variously referred to as talking 
machines, automatons, or simply systems that generated artificial speech. Miller’s use of 
synthesis in relation to human vowels seems to have had the effect of labeling any future 
system that produces artificial speech, regardless of the theory on which it is based, a 
speech synthesizer. 

Interestingly, the waveform synthesis described by Miller was not actually synthesis 
of sound, but rather synthesis of graphical representations of waveforms. To produce 
synthetic sounds, Miller utilized a bank of organ pipes, each of which, by design, pos- 
sessed a different set of resonant frequencies. By controlling the amplitude of the sound 
produced by each pipe, he could effectively produce a set of nearly pure tones that were 
summed together as they radiated into free space. The composite waveform could then 
be captured with the phonodeik device and compared to the graphical synthesis of the 
same vowel. These were primarily vowel synthesizers, where production of each vowel 
required a different collection of pipes. There was little ability to dynamically change 
any aspect of the system except for interrupting the excitation of the pipes themselves; 
Miller did suggest such an approach to forming some basic words. 

At this point in time, about a decade and a half into the 20th century, the mechanical 
and electro-mechanical era of speech synthesis was coming to a close. The elaborate 
talking machines of von Kempelen and Faber that simulated human speech produc- 
tion were distant memories, having been more recently replaced by studies of vow- 
els using electro-mechanical devices that produced the spectral components of speech 
waveforms. Although there was much debate and disagreement about many details on 
the production of speech, primarily vowels, the ideas generated in this era were funda- 
mental to the development of phonetics. It had become firmly established by now (but 
not universally accepted) that the underlying acoustic principle of speech production 
was that resonances formed by a given configuration of an air cavity enhanced or ac- 
centuated the spectral components of a sound source (Rayleigh, 1878). The enhanced 
portions of the spectrum eventually came to be known as “formants,” a term that seems 
to have been first used by Ludimar Hermann in his studies of vowel production using 
phonograph technology (Hermann, 1894, 1895). Thus, the stage had been set to usher 
in the next era of speech synthesis. 


The electrical and electronic era 


A shift from using mechanical and electro-mechanical devices to generate artificial 
speech to purely electrical systems had its beginnings in 1922. It was then that John Q. 
Stewart, a young physicist from Princeton published an article in the journal Nature 


7 


B. Story, final draft 12.15.18 


titled “An Electrical Analogue of the Vocal Organs” (Stewart, 1922). After military 
service in World War I, during which he was the chief instructor of “sound ranging” 
at the Army Engineering School, Stewart had spent two years as research engineer in 
the laboratories of the American Telephone and Telegraph Company and the West- 
ern Electric Company (Princeton Library). His article was a report of research he had 
completed during that time. In it he presents a diagram of a simple electrical circuit 
containing an “interrupter” or buzzer and two resonant branches comprised of vari- 
able resistors, capacitors, and inductors. Noting past research of Helmholtz, Miller, and 
Scripture, Stewart commented that “it seems hitherto to have been overlooked that a 
functional copy of the vocal organs can be devised . . . [with] audio-frequency os- 
cillations in electrical circuits” (1922, p. 311). He demonstrated that a wide range of 
artificial vowels could be generated by adjusting the circuit elements in the resonant 
branches. Because of the ease and speed with which these adjustments could be made 
(e.g., turning knobs, moving sliders, etc.), Stewart also reported success in generating 
diphthongs by rapidly shifting the resonance frequencies from one vowel to another. 
Although the title of the article suggests otherwise, the circuit was not really an electri- 
cal analog of the vocal organs, but rather a means of emulating the acoustic resonances 
they produced. The design was essentially the first electrical formant synthesizer; in- 
terestingly, however, Stewart did not refer to his system as a synthesizer, but rather as 
an electrical analog of the vocal system. 

Stewart moved on to a long productive career at Princeton as an astrophysicist and 
did not further develop his speech synthesizer. He did, however, leave an insightful 
statement at the end of his article that foreshadowed the bane of developing artificial 
speech systems for decades to come, and still holds today. He noted that: 


The really difficult problem involved in the artificial production of speech 
sounds is not the making of the device which shall produce sounds which, in 
their fundamental physical basis, resemble those of speech, but in the manip- 
ulation of the apparatus to imitate the manifold variations in tone which are 
so important in securing naturalness. 

(Stewart, 1922, p. 312) 


Perhaps by “naturalness” it can be assumed he was referring to the goal of achieving 
natural human sound quality as well as intelligibility. In any case, he was clearly aware 
of the need to establish “rules” for constructing speech, and that simply building a device 
with the appropriate physical characteristics would not in itself advance artificial speech 
as a useful technology or tool for research. 
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A few years later, in 1928, a communications engineer named Homer Dudley - also 
working at the Western Electric Company (later to become Bell Telephone Laborato- 
ries) - envisioned a system that could be used to transmit speech across the transatlantic 
telegraph cable (Schroeder, 1981). Because it was designed for telegraph signals, how- 
ever, the cable had a limited bandwidth of only 100 Hz. In contrast, transmission of the 
spectral content of speech requires a minimum bandwidth of about 3000 Hz, and so the 
telegraph cable was clearly insufficient for carrying an electrical analog of the speech 
waveform. The bandwidth limitation, however, motivated Dudley to view speech pro- 
duction and radio transmission analogously. Just as the information content carried by 
a radio signal is embedded in the relatively slow modulation of a carrier wave, phonetic 
information produced by movements of the lips, tongue, jaw, and velum could be con- 
sidered to similarly modulate the sound wave produced by the voice source. That is, 
the speech articulators move at inaudible syllabic rates that are well below the 100 Hz 
bandwidth of the telegraph cable, whereas the voice source or carrier makes the signal 
audible but also creates the need for the much larger bandwidth. Understanding the 
difficulties of tracking actual articulatory movements, Dudley instead designed a circuit 
that could extract low frequency spectral information from an acoustic speech signal via 
a bank of filters, transmit that information along the low-bandwidth cable, and use it to 
modulate a locally supplied carrier signal on the receiving end to reconstruct the speech. 
This was the first analysis-synthesis system in which some set of parameters determined 
by analysis of the original signal could be sent to another location, or perhaps stored for 
later retrieval, and used to synthesize a new version of the original speech. Dudley had 
achieved almost exactly that which Joseph Henry had imagined in that letter he wrote 
long ago about linking together several of Faber’s talking machines to communicate 
across a long distance. 

Dudley’s invention became known as the VOCODER, an acronym derived from 
the two words VOice CODER (to avoid the repetition of capital letters and to reflect 
its addition to our lexicon, “Vocoder” will be used in the remainder of the chapter). 
The Vocoder was demonstrated publicly for the first time on September 11, 1936 at the 
Harvard Tercentary Conference in Cambridge, Massachusetts (Dudley, 1936). During 
an address given by F.B. Jewitt, President of Bell Telephone Laboratories, Dudley was 
called on to demonstrate the Vocoder to the audience (Jewett, 1936) and showed its 
capabilities for analysis and subsequent synthesis of speech and singing. Dudley could 
also already see the potential of using the Vocoder for entertainment purposes (Dudley, 
1939). He noted that once the low frequency spectral modulation envelopes had been 
obtained from speech or song, any signal with sufficiently wide bandwidth could be 
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substituted as the carrier in the synthesis stage. For example, instrumental music or the 
sound ofa train locomotive could be modulated with the spectral-phonetic information 
present in a sentence, producing a bizarre but entirely intelligible synthetic version of 
the original speech utterance (Dudley, 1940). Ironically, due to the international po- 
litical and military events of the late 1930s and early 1940s, the first major application 
of the Vocoder was not to amuse audiences, but rather to provide secure, scrambled 
speech signals between government and military officials during World War II, par- 
ticularly the conversations of Winston Churchill in London and Franklin D. Roosevelt 
in Washington, D.C. 

One of the difficulties that prevented wide acceptance of Vocoder technology for 
general telephone transmission was the problem of accurately extracting pitch (funda- 
mental frequency) from an incoming speech signal (Schroeder, 1993). Transmitting 
pitch variations along with the other modulation envelopes was essential for recon- 
structing natural sounding speech. It was not, however, necessary for transmitting 
intelligible speech, and hence could be acceptably used when the security of a conser- 
vation was more important than the naturalness of the sound quality. Even so, both 
Churchill and Roosevelt complained that the Vocoder made their speech sound silly 
(Tompkins, 2010), certainly an undesirable quality for world leaders. Eventually the 
pitch extraction problem was solved, other aspects were improved, and Vocoder tech- 
nology became a viable means of processing and compressing speech for telephone 
transmission. 

With the capability of isolating various aspects of speech, Dudley also envisioned 
the Vocoder as a tool for research in phonetics and speech science. In 1939, he and 
colleagues wrote, 


After one believes he has a good understanding of the physical nature of 
speech, there comes the acid test of whether he understands the construction 
of speech well enough to fashion it from suitably chosen elements. 

(Dudley et al., 1939a, p. 740) 


Perhaps Dudley realized, much as Stewart (1922) had warned, that building a device 
to decompose a speech signal and reconstruct it synthetically was relatively “easy” in 
comparison to understanding how the fundamental elements of speech, whatever form 
they may take, can actually be generated sequentially by a physical representation of 
the speech production system, and result in natural, intelligible speech. With this goal 
in mind, he and colleagues modified the Vocoder such that the speech analysis stage 
was replaced with manual controls consisting of a keyboard, wrist bar, and foot pedal 
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(Dudley, Riesz, and Watkins, 1939). The foot pedal controlled the pitch of a relaxation 
oscillator that provided a periodic voice source to be used for the voiced components 
of speech; a random noise source supplied the “electrical turbulence” needed for the 
unvoiced speech sounds. Each of the ten primary keys controlled the amplitude of the 
periodic or noise-like sources within a specific frequency band, which together spanned 
a range from 0 to 7500 Hz. By depressing combinations of keys and modulating the 
foot pedal, an operator of the device could learn to generate speech. 

This new synthetic speaker was called the “VODER” (or “Voder”) a new acronym 
that comprised the capitalized letters in “Voice Operation DEmonstratoR” (Dudley, 
Riesz, and Watkins, 1939). In a publication of the Bell Laboratories Record (1939), the 
machine’s original moniker was “Pedro the Voder,” where the first name was a nod to 
Dom Pedro II, a former Emperor of Brazil who famously exclaimed “My God, it talks!” 
after witnessing a demonstration of Bell’s invention of the telephone in Philadelphia in 
1876. The Bell publication (“Pedro the Voder,” 1939) pointed out that the telephone 
did not actually talk, but rather transmitted talk over distance. In contrast, the Voder 
did talk and was demonstrated with some fanfare at the 1939 World’s Fair in New York 
and at the Golden Gate Exposition in San Francisco the same year. It is interesting that 
this publication also states “It is the first machine in the world to do this [i.e., talk]” 
(Bell Labs Pubs, 1939, p. 170). If this was a reference to synthetic speech produced by 
an electronic artificial talker, it is likely correct. But clearly Joseph Faber had achieved 
the same goal by mechanical means almost a century earlier. In fact, the description 
of the Voder on the same page as a “little old-fashioned organ with a small keyboard 
and a pedal” could have easily been used to describe Faber’s machine. In many ways, 
Dudley and colleagues at Bell Labs were cycling back through history with a new form 
of technology that would now allow for insights into the construction of speech that 
the machines of previous eras would not reveal to their makers. 

One of the more interesting aspects of the Voder development, at least from the per- 
spective of phonetics, was how people learned to speak with it. Stanley S.A. Watkins, 
the third author on the Dudley, Riesz, and Watkins (1939) article describing the Voder 
design, was charged with prescribing a training program for a group of people who 
would become “operators.” He first studied the ways in which speech sounds were char- 
acterized across the ten filter bands (or channels) of Voder. Although this was found 
to be useful information regarding speech, it was simply too complex to be useful in 
deriving a technique for talking with the Voder. Various other methods of training 
were attempted, including templates to guide the fingers and various visual indicators, 
but eventually it was determined that the most productive method was for the oper- 
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ator to search for a desired speech sound by “playing” with the controls as guided by 
their ear. Twenty-four people, drawn from telephone operator pools, were trained 
to operate the Voder for the exhibitions at both sites of the 1939 World’s Fair. Typi- 
cally, about one year was required to develop the ability to produce intelligible speech 
with it. In fact, Dudley et al., wrote “the first half [of the year of training was] spent 
in acquiring the ability to form any and all sounds, the second half being devoted to 
improving naturalness and intelligibility” (Dudley, Riesz, and Watkins, 1939, p. 763). 
Once learned, the ability to “speak” with the Voder was apparently retained for years 
afterward, even without continued practice. On the occasion of Homer Dudley’s re- 
tirement in 1961, one of the original trained operators was invited back to Bell Labs for 
an “encore performance” with a restored version of the talking machine. As recalled 
by James Flanagan, a Bell Labs engineer and speech scientist, “She sat down and gave a 
virtuoso performance on the Voder” (Pieraccini, 2012, p. 55). 

In his article “The Carrier Nature of Speech,” Dudley (1940) made a compelling 
analogy of the Voder structure to the human speech production system. But the Voder 
was really a spectrum shaping synthesizer: The cutoff frequencies and bandwidths of 
the ten filters associated with the keyboard were stationary, and so control was imposed 
by allowing the key presses to modulate the signal amplitude within each filter band. 
In effect, this provided the operator a means of continuously enhancing or suppress- 
ing the ten discrete divisions of the spectrum in some selective pattern such that an 
approximation of time-varying formants were generated. It can be noted that Faber’s 
mechanical talking machine from a century earlier presented an operator with essen- 
tially the same type of interface as the Voder (i.e., keyboard, foot pedal), but it was 
the shape of cavities analogous to the human vocal tract that were controlled rather 
than the speech spectrum itself. In either case, and like a human acquiring the ability 
to speak, the operators of the devices learned and internalized a set of rules for gener- 
ating speech by modulating a relatively high-frequency carrier signal (i.e., vocal fold 
vibration, turbulence) with slowly varying, and otherwise inaudible, “message waves” 
(Dudley, 1940). Although the ability of a human operator to acquire such rules is 
highly desirable for performance-driven artificial speech, it would eventually become 
a major goal for researchers in speech synthesis to explicate such rules in an attempt 
to understand phonology and motor planning in speech production, as well as to de- 
velop algorithms for transforming symbolic phonetic representations into speech via 
synthetic methods. 

The research at Bell Labs that contributed to the Vocoder and Voder occurred in 
parallel with development of the “sound spectrograph” (Potter, 1945), a device that 
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could graphically represent the time-varying record of the spectrum of a sound rather 
than the waveform. The output of the device, called a “spectrogram,” was arranged 
such that time and frequency were on the x-axis and y-axis, respectively, and intensity 
was coded by varying shades of gray. It could be set to display either the narrowband 
harmonic structure of a sound, or the wideband formant patterns. Although develop- 
ment of the spectrograph had been initiated by Ralph Potter and colleagues just prior 
to the United States’ involvement in World War II, it was given “official rating as a war 
project” (Potter, 1945, p. 463) because of its potential to facilitate military communica- 
tions and message decoding. During the war, the spectrograph design was refined and 
used extensively to study the temporo-spectral patterns of speech based on the “spectro- 
grams” that it generated. It wasn’t until several months after the war ended, however, 
that the existence of the spectrograph was disclosed to the public. On November 9, 
1945, Potter published an article in Science titled “Visible Patterns of Sound” in which 
he gave a brief description of the device and explained its potential application as a tool 
for studying phonetics, philology, and music. He also suggested its use as an aid for per- 
sons who are hearing impaired; the idea was that transforming speech from the auditory 
to the visual domain would allow a trained user to “read” speech. Other publications re- 
garding the spectrograph soon followed with more detailed descriptions concerning its 
design (Koenig, Dunn, and Lacy, 1946; Koenig and Ruppel, 1948) and use (Kopp and 
Green, 1946; Steinberg and French, 1946; Potter and Peterson, 1948; Potter, 1949). 

Just as instrumentation that allowed researchers to see acoustic speech waveforms 
had motivated earlier methods of synthesis (e.g., Miller, 1916), the spectrographic vi- 
sualization of speech would rapidly inspire new ways of synthesizing speech, and new 
reasons for doing so. Following World War II, Frank Cooper and Alvin Liberman, 
researchers at Haskins Laboratories in New York City, had begun extensive analyses 
of speech using a spectrograph based on the Bell Labs design. Their goals, which were 
initially concerned with building a reading machine for the blind, had been diverted 
to investigations of the acoustic structure of speech, and how they were perceived and 
decoded by listeners. They realized quickly, however, that many of their questions 
could not be answered simply by inspection of spectrograms. What was needed was 
a means of modifying some aspect of the visual representation of speech provided by 
the spectrogram, and transforming it back into sound so that it could be presented to a 
listener as a stimulus. The responses to the stimuli would indicate whether or not the 
spectral modification was perceptually relevant. 

In 1951, Cooper, Liberman, and Borst reported on the design of a device that would 
allow the user to literally draw a spectrographic representation of a speech utterance on 


13 


B. Story, final draft 12.15.18 


a film transparency and transform it into a sound wave via a system including a light 
source, tone wheel, photocell, and amplifier. The tone wheel contained 50 circular 
sound tracks that, when turned by a motor at 1800 rpm, would modulate light to gen- 
erate harmonic frequencies from 120-6000 Hz, roughly covering the speech spectrum. 
The photocell would receive only the portions of spectrum corresponding to the pattern 
that had been drawn on the film, and convert them to an electrical signal which could 
be amplified and played through a loudspeaker. The “drawn” spectrographic pattern 
could be either a copy or modification of an actual spectrogram, and hence the device 
came to be known as the “Pattern Playback.” It was used to generate stimuli for nu- 
merous experiments on speech perception and contributed greatly to knowledge and 
theoretical views on how speech is decoded (cf., Liberman, Delattre, and Cooper, 1952, 
1954, 1957; Liberman et al., 1957; Harris et al., 1958; Liberman, Delattre, and Cooper, 
1967). The Pattern Playback was the first speech synthesizer used for large-scale sys- 
tematic experimentation concerning the structure of speech, and proved to be most 
useful for investigations concerning isolated acoustic cues such as formant transitions at 
the onset and offset of consonants (Delattre et al., 1952; Borst, 1956). 

The usefulness of speech synthesizers as research tools was summarized in a review 
article by Cooper (1961) in which he wrote: 


The essential point here, as in all of science, is that we must simplify Na- 
ture if we are to understand her. More than that: we must somehow choose 
a particular set of simplifying assumptions from the many sets that are possi- 
ble. The great virtue of speech synthesizers is that they can help us make this 
choice. 

(Cooper, 1961, p. 4) 


The Pattern Playback served the purpose of “simplifying Nature” by making the 
spectrotemporal characteristics of speech accessible and manipulable to the researcher. 


6. 


When used in this manner, a speech synthesizer becomes an experimenter’s “versatile 
informant” that allows for testing hypotheses about the significance of various spectral 
features (Cooper, 1961, pp. 4-5). One of advantages of the Pattern Playback was that 
virtually anything could be drawn (or painted) on the film transparency regardless of the 
complexity or simplicity, and it could be heard. For example, all of the detail observed 
for a speech utterance in a spectrogram could be reconstructed, or something as simple 
as a sinusoid could be drawn as a straight line over time. The disadvantage was that 
the only means of generating an utterance, regardless of the accuracy of the prescribed 


rules, was for someone to actually draw it by hand. 
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The users of the Pattern Playback became quite good at drawing spectrographic 
patterns that generated intelligible speech, even when they had not previously seen an 
actual spectrogram of the utterance to be synthesized (Delattre et al., 1952; Liberman et 
al., 1959). Much like the operators of the speaking machines that preceded them, they 
had, through practice, acquired or internalized a set of rules for generating speech. 
Delattre et al. (1952) did attempt to characterize some speech sounds with regard to 
how they might be drawn spectrographically, but it was Frances Ingemann who for- 
malized rules for generating utterances with the Pattern Playback (Ingemann, 1957; 
Liberman et al., 1959). The rules were laid out according to place, manner, and voic- 
ing, and could be presumably used by a novice to draw and generate a given utterance. 
Although the process would have been extremely time consuming and tedious, Mat- 
tingly (1974) notes that this was the first time that explicit rules for generating speech 
with a synthesizer had been formally documented. 

Other types of synthesizers were also developed during this period that facilitated 
production of artificial speech based on acoustic characteristics observed in a spectro- 
gram, but were based on different principles than the Pattern Playback. In 1953, Wal- 
ter Lawrence, a researcher for the Signals Research and Development Establishment in 
Christchurch, England, introduced a speech synthesizer whose design consisted of an 
electrical circuit with a source function generator and three parallel resonant branches. 
The frequency ofeach resonance could be controlled by the user, as could the frequency 
and amplitude of the source function. Together, the source and resonant branches pro- 
duced a waveform with a time-varying spectrum that could be compared to a spectro- 
gram, or modified for purposes of determining the perceptual relevance of an acoustic 
cue. Because the parameters of the circuit (i.e., resonance frequencies, source fun- 
damental frequency, etc.) were under direct control, Lawrence’s synthesizer became 
known as the “Parametric Artificial Talker” or “PAT” for short. PAT was used by Pe- 
ter Ladefoged and David Broadbent to provide acoustic stimuli for their well-known 
study of the effects of acoustic context on vowel perception (Ladefoged and Broadbent, 
1957). 

At about the same time, Gunnar Fant was also experimenting with resonant circuits 
for speech synthesis at the Royal Institute of Technology (KTH) in Stockholm. In- 
stead of placing electrical resonators in parallel as Lawrence did in building PAT, Fant 
configured them in a series or “cascade” arrangement. Fant’s first cascade synthesizer, 
called “OVE I,” an acronym based on the words “Orator Verbis Electris,” was primarily 
a vowel synthesizer that had the unique feature of a mechanical stylus that could be 
moved in a two-dimensional plane for control of the lowest two resonance frequencies. 
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A user could then generate speech (vowels and vowel transitions) by moving the sty- 
lus in the vowel space defined by the first two-formant frequencies, a system that may 
have had great value for teaching and learning the phonetics of vowels. It may have 
had some entertainment value as well. Fant (2005) reminisced that one of the three 
“opponents” at his doctoral dissertation defense in 1958 was, in fact, Walter Lawrence 
who had brought his PAT synthesizer with him to Stockholm. At one point during 
the defense proceedings Fant and Lawrence demonstrated a synthesizer dialogue be- 
tween PAT and OVE I. Eventually, Fant developed a second version of the cascade-type 
synthesizer called “OVE II” (Fant and Martony, 1962). The main enhancements were 
additional subsystems to allow for production of nasals, stops, and fricatives, as well as a 
conductive ink device for providing time-varying parameter values to the synthesizer. 

The development of PAT and OVE set the stage for a category of artificial speech 
that would eventually be referred to as formant synthesis, because they provided for 
essentially direct control of the formants observed in a spectrogram. In a strict sense, 
however, they are resonance synthesizers because the parameters control, among other 
things, the frequencies of the electrical (or later, digital) resonators themselves. In most 
cases, though, these frequencies are aligned with the center frequency of a formant, and 
hence resonance frequency and formant frequency become synonymous. Although it 
may seem like a minor technological detail, the question of whether such synthesiz- 
ers should be designed with parallel or cascaded resonators would be debated for years 
to come. A parallel system offers the user the largest amount control over the spec- 
trum because both the resonator frequencies and amplitudes are set with parameters. 
In contrast, in a cascade system the resonance frequencies are set by a user, while their 
amplitudes are an effect of the superposition of multiple resonances, much as is the case 
for the human vocal tract (Flanagan, 1957). Thus, the cascade approach could po- 
tentially produce more natural sounding speech, but with somewhat of a sacrifice in 
control. Eventually, Lawrence reconfigured PAT with a cascade arrangement of res- 
onators, but after many years of experimentation with both parallel and cascade systems, 
John Holmes of the Joint Speech Research Unit in the U.K. later made a strong case 
for a parallel arrangement (Holmes, 1983). He noted that replication of natural speech 
is considerably more accurate with user control of both formant frequencies and their 
amplitudes. 

Simultaneous with the development of formant synthesizers in the 1950s, was an- 
other type of synthesis approach, also based on electrical circuits, but intended to serve 
as a model of the shape of the vocal tract so that the relation of articulatory configura- 
tion to sound production could be more effectively studied. The first of this type was 
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described in 1950 by H.K. Dunn, another Bell Labs engineer. Instead of building reso- 
nant circuits to replicate formants, Dunn (1950) designed an electrical transmission line 
in which consecutive (and coupled) “T-sections,” made up of capacitors, inductors, and 
resistors, were used as analogs of the pharyngeal and oral air cavities within the vocal 
tract. The values of the circuit elements within each T-section were directly related to 
the cross-sectional area and length of the various cavities, and thus the user now had 
parametric control of the vocal tract shape. Although this was an advance, the vocal 
tract configurations that could be effectively simulated with Dunn’s circuit were fairly 
crude representations of the human system. 

Stevens, Kasowski, and Fant (1953), in their article “An Electrical Analog of the 
Vocal Tract,” describe a variation on Dunn’s design using a similar transmission line 
approach; however, they were able to represent the vocal tract shape as a concatenation 
of 35 cylindrical sections, where each section was 0.5 cm in length. The purpose in pur- 
suing a more detailed representation of the vocal tract shape was to be able to “study in 
detail the mechanism of speech production and to investigate correlations between the 
acoustic and articulatory aspects of speech” and noted that “a speech synthesizer would 
be required to simulate more closely the actual dimensions of the vocal tract” (p. 735). 
Fant also began work on his own version of a Line Electrical Analog (LEA) that he used 
for studies of speech sounds. Both Stevens and House (1955) and Fant (1960) used these 
very similar synthesizers to better understand vowel articulation by first developing a 
three parameter model of the vocal tract shape in which the location and radius of the 
primary vowel constriction were specified, along with the ratio of lip termination area 
to lip tube length. Their synthesizers allowed for a systematic exploration of the para- 
metric space and resulted in nomographic displays that demonstrated the importance 
of the location (place) and cross-sectional area of the primary vocal tract constriction in 
vowels. Collectively, this work significantly altered the view of vowel production. 

A limitation of both the Stevens, Kasowski, and Fant (1953) and Fant (1960) line ana- 
log synthesizers was that they could not generate time-varying speech sounds because 
they accommodated only static vocal tract configurations; i.e., they couldn’t actually 
talk. Using a more complex line analog circuit system and a bank of switches, George 
Rosen, a doctoral student at the Massachusetts Institute of Technology, devised a means 
of transitioning from one vocal tract configuration to another (Rosen, 1958). This new 
system, known as “DAVO” for “dynamic analog of the vocal tract,” could generate fairly 
clear diphthongs and consonant-vowel (CV) syllables, but was not capable of sentence- 
level speech. 

It can be noted that the parametric models of the vocal tract shape developed by 
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Stevens and House (1955) and Fant (1960) were independent of the transmission line 
analogs that were used to produce the actual synthetic speech. The limitation of only 
static vowels, or CVs in the case of DAVO, was entirely due to the need for a complicated 
electrical circuit to simulate the propagation of acoustic waves in the vocal tract. The 
vocal tract models themselves could have easily been used to generate time-dependent 
configurations over the time course of a phrase or sentence, but a system for producing 
the corresponding synthetic speech waveform with such temporal variation simply did 
not yet exist, nor did the knowledge of how to specify the time-dependence of the 
vocal tract parameters. 

Yet another type of speech synthesizer was also under development during the 
1950s. With significant improvements in the state of audio recording technology, 
particularly those related to storing speech waveforms on magnetic tape, it was now 
possible to consider synthesis - perhaps in the purest sense of the word - based on splic- 
ing together small segments of prerecorded natural speech. Harris (1953a) designed a 
system in which segments of tape were isolated that contained many instances (allo- 
phones) of each consonant and vowel. Then, with a recording drum, tape loop, and 
timing and selector circuits (Harris, 1953b), synthetic speech could be generated by 
piecing together a sequence of segments deemed to match well with regard to formant 
frequencies and harmonics. The speech produced was found to be fairly intelligible 
but quite unnatural, presumably because of the discontinuities created at the segment 
boundaries. 

Rather than focusing on vowel and consonant segments, Peterson, Wang, and Sivert- 
sen (1958) experimented with alternative segmentation techniques and determined that 
a more useful unit for synthesizing speech could be obtained from segments extending 
in time from the steady-state location of one phoneme to the next. Referring to this 
unit as a “dyad,” they suggested that it preserved the acoustic dynamics of the transi- 
tions between phonemes, precisely the information lost when the segmentation unit 
is the phoneme itself. The potential of this method to generate intelligible speech was 
demonstrated by Wang and Peterson (1958) where they constructed a sentence from 
more than 40 dyad segments extracted from previously recorded utterances. Much 
care was required in selecting the segments, however, in order to maintain continuity 
of pitch, intensity, tempo, and vocal quality. The range of phonetic characteristics that 
can be generated in synthetic speech by concatenating segments is, of course, limited 
by the segment inventory that is available. Sivertsen (1961) conducted an extensive 
study of the size of inventory needed relative to the type of segmentation unit chosen. 
She considered various segments with a wide range of temporal extent that included 
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phonemes, phoneme dyads, syllable nuclei, half syllables, syllables, syllable dyads, and 
words, and found that, in general, “the size of the inventory increases with the length 
of the segment” (Sivertsen, 1961, p. 57). That is, a few small units can be combined 
in many ways to generate hundreds or thousands of increasingly larger units, but if 
the starting point is a large temporal unit, an enormous number is needed because the 
possibilities for recombining them are severely limited. 

This approach to synthesis clearly has played a major role in technological appli- 
cations such as modern text-to-speech systems utilizing unit selection techniques (cf., 
Moulines and Charpentier, 1990; Sagisaka et al., 1992; Hunt and Black, 1996), but 
Sivertsen (1961) also made a strong case for the use of segment concatenation meth- 
ods as a research tool. In particular, she noted that using stored segments of various 
lengths can be used for evaluating some theories of linguistic structure, as well as for 
investigating segmentation of speech signals in general. In fact, Sivertsen suggested 
that essentially all speech synthesis methods could be categorized relative to how the 
speech continuum is segmented. If the segmentation is conceived as “simultaneous 
components” then speech can be synthesized by controlling various parametric repre- 
sentations “independently and simultaneously.” These may be physiological parame- 
ters such as vocal tract shape, location and degree of constriction, nasal coupling, and 
laryngeal activity, or acoustical parameters such as formant frequencies, formant band- 
widths, fundamental frequency, voice spectrum, and amplitude. If, instead, the speech 
continuum in segmented in time, synthetic speech can be accomplished by sequencing 
successive “building blocks,” which may be extracted from recorded natural speech or 
even generated electronically. 

The advent of digital computing in the early 1960s would dramatically change the 
implementation of speech synthesizers, and the means by which they may be con- 
trolled. The underlying principles of the various synthesis methods, however, are often 
the same or least similar to those that motivated development of mechanical, electri- 
cal, or electronic talking devices. Thus, delineation of synthesizer type relative to the 
segmentation of the speech continuum is particularly useful for understanding the dif- 
ferences and potential uses of the wide range of synthetic speech systems that had so far 
been advanced at the time, and also for those yet to be developed. 


The digital and computational era 


As Stewart (1922) had suggested in the early days of electrical circuit-based syn- 
thesis, building a device capable of producing sounds that resemble speech is far less 
difficult than knowing how to impose the proper control on its parameters to make the 
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device actually talk. Although much progress had been made in development of vari- 
ous types of systems, controlling electronic speech synthesizers by manipulating circuit 
parameters, whether they were vocal tract or terminal analog types, was cumbersome 
and tedious. This could now be mitigated to some degree, however, by engaging a 
digital computer to control speech synthesizers that were, themselves, still realized as 
electronic circuits. That is, commands typed on a keyboard could be transformed into 
control voltages that imposed parameter changes in the synthesis circuitry. In effect, 
this allowed the “computational load” for generating the speech waveform to remain 
in the analog circuit, while transferring the control of the system to a user via a com- 
puter interface. It would not be long, however, before the hardware synthesizers were 
replaced with software realizations of the same circuit elements, offering far greater 
flexibility and ease with which synthetic speech could be generated. 

Digital control facilitated development of “speech synthesis by rule” in which an 
orthographic representation of an utterance could be transformed into artificial speech. 
Based on a set of “rules” embedded in a computer program, a series of symbols repre- 
senting the phonetic elements of a word or phrase were converted to temporal variations 
of the parameters of a specific type of synthesizer. For example, Holmes, Mattingly, and 
Shearme (1964) described the rules and associated computer program that calculated 
the time course of the parameters of a parallel resonance (formant) synthesizer. These 
included, among other variables, frequencies and amplitudes of three resonances, and 
fundamental frequency. A word such as “you” (/ju/) might be produced with a sim- 
ple interpolation of the second formant frequency, F2, from a high value, say 2200 
Hz, down to a much lower value, perhaps 400 Hz, while other parameters could be 
held constant. The interpolated F2 would then be used to alter the settings of circuit 
elements over a particular period of time, resulting in a speech waveform resembling 


A similar goal of producing “synthetic speech from an input consisting only of the 
names of phonemes and a minimum of pitch and timing information” was pursued by 
Kelly and Lochbaum (1962, p. 1), but in their system a digital lattice filter, entirely 
realized as a computer algorithm, was used to calculate the effective propagation of 
acoustic waves in an analog of the vocal tract configuration. Control of the system 
required specification of 21 time-varying cross-sectional areas representing the vocal 
tract shape along the axis extending from the glottis to lips, as well as nasal coupling, 
fundamental frequency, aspiration, and affrication. Each phoneme was assigned a vocal 
tract shape (i.e., 21 cross-sectional areas) read from lookup table; change in tract shape 
was accomplished by linearly interpolating, over time, each cross-sectional area speci- 
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fied for one phoneme to those of the next phoneme. This design functioned essentially 
as a digital version of a synthesizer like George Rosen’s DAVO; but because it was soft- 
ware rather than hardware, it allowed for more precise specification of the vocal tract 
shape and almost endless possibilities for experimentation with interpolation types and 
associated rules. 

Kelly and Lochbaum expressed disappointment in the sound quality of the speech 
generated by their system, but attributed the problem to inadequate knowledge of the 
cross-sectional areas that were used as the vocal tract shapes corresponding to phoneme 
targets. Although based on Fant’s (1960) well-known collection of vocal tract data 
obtained from X-ray images, it would not be until the 1990s when imaging meth- 
ods would allow for three-dimensional reconstructions of vocal tract shapes produced 
by human talkers (cf, Baer et al., 1991; Story, Titze, and Hoffman, 1996, 1998), and 
hence, improve this aspect of analog vocal tract synthesis. The time-varying spatial 
characteristics of a linearly interpolated vocal tract shape were, however, also potential 
contributors to the undesirable quality of the synthesis. A more complex set of rules for 
control of a vocal tract analog was described a few years later by Nakata and Mitsuoka 
(1965), and resulted in intelligible speech with “fairly good naturalness.” Nonetheless, 
they, along with others believed that significant improvements in vocal tract analog 
synthesis required more detailed knowledge of realistic articulatory movement from 
which better timing rules could be derived. 

By the 1960s, X-ray cineradiography technology had developed to a point where 
the spatial and temporal resolution were suitable for studying the articulatory move- 
ments of speech in a sagittal projection image. Motion picture X-ray films collected 
for various speech utterances could be analyzed frame by frame to track the changing 
positions of articulators and the time-varying configuration of the vocal tract outline. 
Just as the instrumentation that allowed scientists to see waveforms and spectrograms 
had motivated earlier forms of synthetic speech, the ability to now see the movement 
of the articulators motivated development of a new type of synthesis paradigm called 
“articulatory synthesis.” 

In 1967, Cecil Coker of Bell Laboratories demonstrated a synthesis system based on a 
computational model of the speech articulators. Simplified positions of the tongue, jaw, 
lips, velum, and larynx were represented in the midsagittal plane, where each could 
be specified to move with a particular timing function. The result was a time-varying 
configuration of the midsagittal vocal tract outline. To produce the speech waveform, 
the distances across the vocal tract airspace from glottis to lips at each time sample first 
needed to be converted to cross-sectional areas to form the area function (cf., Heinz and 
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Stevens, 1964). These were then used in a vocal tract analog model like that of Kelly and 
Lochbaum (1962) to calculate wave propagation through the system. The resonance 
frequencies could also be calculated directly from the time-varying area function and 
used to control a formant synthesizer (Coker and Fujimura, 1966). Similar articulatory 
models were developed by Lindblom and Sundberg (1971) and Mermelstein (1973), 
but incorporated somewhat more complexity in the articulatory geometry. In any case, 
the temporal characteristics of the synthesized articulatory movement could be com- 
pared to, and refined with, data extracted from midsagittal cineradiography films (e.g., 
Truby, 1965). An articulatory synthesizer developed at Haskins Laboratories called 
“ASY” (Rubin, Baer, and Mermelstein, 1981), extended the Mermelstein model, incor- 
porating several additional sub-models and an approach based on key frame animation 
for synthesizing utterances derived from control of the movement over time of the vo- 
cal tract. This was one of the earliest articulatory synthesis tools used for large-scale 
laboratory phonetic experiments (e.g., Abramson et al., 1981). It was later enhanced 
to provide more accurate representations of the underlying vocal tract parameters and 
flexibility in their control by a user (the Haskins Configurable Articulatory Synthesizer, 
or CASY; see Rubin et al., 1996). 

Articulatory synthesis held much promise because it was assumed that the rules re- 
quired to generate speech would be closer to those used by a human talker than rules 
developed for controlling acoustic parameters such as formant frequencies. While that 
may ultimately be the case, such rules have been difficult to define in such a way that 
natural sounding, intelligible speech is consistently generated (Klatt, 1987). Articula- 
tory synthesizers have become important tools for research, however, because they can 
serve as a model of speech production in which the acoustic consequences of parametric 
variation of an articulator can be investigated. Using the ASY synthesizer (Rubin, Baer, 
and Mermelstein, 1981) to produce speech output, Browman et al. (1984), Browman 
and Goldstein (e.g., 1985, 1991), Saltzman (1986, 1991), and others at Haskins Labora- 
tories embarked on research to understand articulatory control. Guiding this work was 
the hypothesis that phonetic structure could be characterized explicitly as articulatory 
movement patterns, or “gestures.” Their system allowed for specification of an utterance 
as a temporal schedule of “tract variables,” such as location and degree of a constriction 
formed by the tongue tip or tongue body, lip aperture, and protrusion, as well as states 
of the velum and glottis. These were then transformed into a task-dynamic system that 
accounted for the coordination and dynamic linkages among articulators required to 
carry out a specified gesture. Over the years, techniques for estimating the time course 
of gesture specification have continued to be enhanced. Recently, for example, Nam et 
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al. (2012) reported a method for estimating gestural “scores” from the acoustic signal 
based on an iterative analysis-by-synthesis approach. 

Some developers of articulatory synthesis systems focused their efforts on particular 
subsystems such as the voice source. For example, Flanagan and Landgraf (1968) pro- 
posed a simple mass-spring-damper model of the vocal folds that demonstrated, with 
a computational model, the self-oscillating nature of the human vocal folds. Shortly 
thereafter, a more complex two-mass version was described (Ishizaka and Flanagan, 
1972; Ishizaka and Matsudaira, 1972) that clearly showed the importance of the vertical 
phase difference (mucosal wave) in facilitating vocal fold oscillation. Additional degrees 
of freedom were added to the anterior-posterior dimension of the vocal folds by Titze 
(1973, 1974) with a 16-mass model. Although the eventual goal of subsystem mod- 
eling would be integration into a full speech synthesis system (cf., Flanagan, Ishizaka, 
and Shipley, 1975; Sondhi and Schroeter, 1987), much of their value is as a tool for un- 
derstanding the characteristics of the subsystem itself. Maeda (1988, 1990), Dang and 
Honda (2004), and Birkholz (2013) are all examples of more recent attempts to inte- 
grate models of subsystems into an articulatory speech synthesizer, whereas Guenther 
(cf., 1994) and Kr’oger et al. (2010) have augmented such synthesizers with auditory 
feedback and learning algorithms. The main use of these systems has been to study 
some aspect of speech production, but not necessarily the conversion of a symbolic rep- 
resentation of an utterance into speech. It can also be noted that a natural extension of 
articulatory synthesis is the inclusion of facial motion that coincides with speaking, and 
has led to development audiovisual synthetic speech systems that can be used explore 
multi- modal nature of both speech production and perception (cf., Yehia, Rubin, and 
Vatikiotis-Bateson, 1998; Massaro, 1998; Vatikiotis-Bateson et al., 2000). This area will 
be covered in the chapter entitled “New horizons in clinical phonetics.” 

Other researchers focused on enhancing models of the vocal tract analogs without 
adding the complexity of articulatory components. Strube (1982), Liljencrants (1985), 
and Story (1995) all refined the digital lattice filter approach of Kelly and Lochbaum 
(1962) to better account for the acoustic properties of time-varying vocal tract shapes 
and various types of energy losses. Based on the relation of small perturbations of a 
tubular configuration to changes in the acoustic resonance frequencies, Mrayati, Carré, 
and Guérin (1988) proposed that speech could be produced by controlling the time- 
varying cross-sectional area of eight distinct regions of the vocal tract. The idea was 
that expansion or constriction of these particular regions would maximize the change in 
resonance frequencies, thus providing an efficient means of controlling the vocal tract 
to generate a predictable acoustic output. Some years later a set of rules was developed 
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by Hill, Manzara, and Schock (1995) that specified the transformation of text input into 
region parameters and were used to build a vocal tract-based text-to-speech synthesizer. 
In a similar vein, Story (2005, 2013) has described an “airway modulation model” of 
speech production (called “TubeTalker”) in which an array of functions can be activated 
over time to deform the vocal tract, nasal tract, and glottal airspaces to produce speech. 
Although this model produces highly intelligible speech, its primary use is for studying 
the relation of structure and movement to the acoustic characteristics produced, and 
the perceptual response of listeners (cf, Story and Bunton, 2010). 

In parallel with development of articulatory-type synthesizers was the enhancement 
of resonance or formant-based synthesis systems. Along with numerous colleagues, 
Dennis Klatt’s research on digital resonators as well as his studies on the acoustic char- 
acteristics of nearly all aspects of speech, led to a comprehensive system of rule-based 
formant synthesis. With various names such as “Klattalk,” “MITalk,” “DecTalk,” and 
later “KLSYN88,” this type of text-to-speech system has become well known to the 
public, particularly because of its collection of standard voices (cf., Klatt, 1982, 1987; 
Klatt and Klatt, 1990). Perhaps best known today is “Perfect Paul,” the voice that was 
synonymous with the late British physicist Stephen Hawking who used the synthesizer 
as an augmentative speaking device. Formant synthesis can also be combined with ar- 
ticulatory methods. “HLSyn,” developed by Hanson and Stevens (2002), is a system 
designed to superimpose high level (HL) articulatory control on the Klatt formant syn- 
thesizer. This approach simplified the control scheme by mapping 13 physiologically 
based HL parameters to the 40-50 acoustic parameters that control the formant syn- 
thesis. The advantage is that the articulatory parameters constrain the output so that 
physiologically unrealistic combinations of the voice source and vocal tract filter cannot 
occur. This type of synthesizer can serve as another tool for studying speech production 
with regard to both research and educational purposes. 

Throughout this chapter, it has been presumed that regardless of the reasons for 
developing a particular type of synthesizer, at some level, the goal was to generate 
high-quality, intelligible speech. Some synthesizers have been developed, however, for 
the explicit purpose of degrading or modifying natural, recorded speech. Such synthe- 
sizers are useful for investigating speech perception because they allow researchers to 
systematically remove many of the acoustic characteristics present in the signal while 
preserving only those portions hypothesized to be essential cues. Remez et al. (1981) 
described a synthesis technique in which the first three formant frequencies tracked 
over the duration of a sentence, were replaced by the summation of three tones whose 
frequencies were swept upward and downward to match the temporal variation of the 
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formants. Although the quality of the synthesized sound was highly artificial (per- 
haps otherworldly), listeners were able to identify the sentences as long as the tones 
were played simultaneously, and not in isolation of one another, revealing the power 
of speech cues that are embedded in the dynamic spectral patterns of the vocal tract 
resonances. Shannon et al. (1995) showed that intelligible speech could alternatively 
be synthesized by preserving temporal cues, while virtually eliminating spectral infor- 
mation. Their approach was essentially the same as Dudley’s Vocoder (1939) in which 
the speech signal was first filtered into a set of frequency bands, and time-varying am- 
plitude envelopes were extracted from each band over the duration of the recorded 
speech. The difference was that the number of bands ranged from only one to four, 
and the amplitude envelopes were used modulate a noise signal rather than an estima- 
tion of the voice source. Shannon et al., showed that listeners were adept at decoding 
sentence-level speech with only three bands of modulated noise. Similarly designed 
synthesizers (e.¢., Loizou, Dorman, and Tu, 1999) have been used simulate the signal 
processing algorithms in cochlear implant devices for purposes of investigating speech 
perception abilities of listeners under these conditions. Yet another variation on this 
type of synthesis was reported by Smith, Delgutte, and Oxenham (2002) who devel- 
oped a technique to combine the spectral fine structure of one type of sound with the 
temporal variation of another to generate “auditory chimeras.” These have been shown 
to be useful for investigating aspects of auditory perception. 

Many other types of speech synthesis methods have been developed in the digital 
era whose primary purpose is to generate high quality speech for automated messaging 
or be embodied in a digital assistant that converses with a user. These systems typically 
make use of synthesis techniques that build speech signals from information available 
in a database containing many hours of recordings of one or more voice professionals 
who produced a wide range of spoken content and vocal qualities. The “unit selection” 
technique, also referred to as “concatenative synthesis,” is essentially the digital realiza- 
tion of the tape splicing method of Harris (1953b) and Peterson, Wang, and Sivertsen 
(1958), but now involves a set of algorithms that efficiently search the database for small 
sound segments, typically at the level of diphones, that can be stacked serially in time 
to generate a spoken message. A different technique, called “parametric synthesis,” 
relies on an extensive analysis of the spectral characteristics of speech recordings in a 
database to establish parametric representations that can later be used to reconstruct a 
segment of speech (Zen, Tokuda, and Black, 2009). Unit selection typically produces 
more natural sounding speech but is limited by the quality and size of the original 
database. Parametric synthesis allows for greater flexibility with regard to modifica- 
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tion of voice characteristics, speaking style, and emotional content, but generally is of 
lower overall quality. Both techniques have been augmented with implementation of 
deep learning algorithms that improve the efficiency and accuracy of constructing a 
spoken utterance, as well as increasing the naturalness and intelligibility of the syn- 
thetic speech (Zen, Senior, and Schuster, 2013; Capes et al., 2017). More recently, 
a new approach called direct waveform modeling has been introduced that utilizes a 
deep neural network (DNN) to generate new speech signals based on learned features 
of recorded speech (cf., van den Oord et al., 2016; Arik et al., 2017). This method has 
the potential to significantly enhance the quality and naturalness of synthetic speech 
over current systems, even though it is currently computationally expensive. It can be 
noted, however, that because unit selection, parametric, and direct waveform synthe- 
sizers construct speech signals based on underlying principles that are not specifically 
related to the ways in which a human forms speech, they are perhaps less useful as a 
tool for testing hypotheses about speech production and perception than many of the 
other techniques discussed in this chapter. 


Summary 


For centuries, past to present, humans have been motivated to build machines that 
talk. Other than the novelty, what is the purpose of speech synthesis, and what can be 
done with it? Certainly, technological applications have resulted from development of 
these devices, many of them having a major impact on how humans communicate with 
each other. Mattingly (1974) seems to have hit it just about right when he suggested that 
the “traditional motivation for research in speech synthesis” has been simply the desire 
to explain the mystery of how we humans successfully use our vocal tracts to produce 
connected speech. In other words, the primary means of scientifically investigating 
speech production has been based on building artificial talking systems and collecting 
relevant data with which to refine them. Mattingly (1974) also points out that, re- 
gardless of the underlying principles of the synthetic speech system built, the scientific 
questions are almost always concerned with deriving the “rules” that govern produc- 
tion of intelligible speech. Such rules may be elaborate and explicitly stated algorithms 
for transforming a string of text into speech based on a particular type of synthesizer, 
or more subtly implied as general movements of structures or acoustic characteristics. 
In any case, achieving an understanding of the rules and all their variations, can be 
regarded as synonymous with understanding many aspects of speech production and 
perception. As in any area of science, the goal in studying speech has been to first de- 
termine the important facts about the system. Artificial talkers and speech synthesizers 
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embody these “facts,” but they typically capture just the essential aspects of speech. As 
a result, synthesis often presents itself as an aural caricature that can be perceived as an 
unnatural, and sometimes amusing rendition of a desired utterance or speech sound. 
It is particularly unique to phonetics and speech science that the models used as tools 
to understand the scientific aspects of a complex system produce a signal intended to 
be heard as if it were a human. As such, the quality of speech synthesis can be rather 
harshly judged because the model on which it is based has not accounted for the myriad 
of subtle variations and details that combine in natural human speech. Thus, we should 
keep in mind that the degree to which we can produce convincing artificial speech is a 
measure of the degree to which we understand human speech production. 
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